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• Background Genomic selection or genome-wide selection (GS) has been highlighted as a new approach for 
marker-assisted selection (MAS) in recent years. GS is a form of MAS that selects favourable individuals 
based on genomic estimated breeding values. Previous studies have suggested the utility of GS, especially for 
capturing small-effect quantitative trait loci, but GS has not become a popular methodology in the field of 
plant breeding, possibly because there is insufficient information available on GS for practical use. 

• Scope In this review, GS is discussed from a practical breeding viewpoint. Statistical approaches employed in 
GS are briefly described, before the recent progress in GS studies is surveyed. GS practices in plant breeding are 
then reviewed before future prospects are discussed. 

• Conclusions Statistical concepts used in GS are discussed with genetic models and variance decomposition, 
heritability, breeding value and linear model. Recent progress in GS studies is reviewed with a focus on empirical 
studies. For the practice of GS in plant breeding, several specific points are discussed including linkage disequi- 
librium, feature of populations and genotyped markers and breeding scheme. Currently, GS is not perfect, but it is 
a potent, attractive and valuable approach for plant breeding. This method will be integrated into many practical 
breeding programmes in the near future with further advances and the maturing of its theory. 

Key words: Genomic selection, plant breeding, marker assisted selection, genetic model, linkage 
disequilibrium. 



INTRODUCTION 

Genomic selection or genome-wide selection (GS) has been high- 
lighted as a new approach for marker- assisted selection (MAS) in 
recent years. GS is a form of MAS that selects favourable indivi- 
duals based on genomic estimated breeding values (GEBVs). 
Breeding values have not been a popular index in plant breeding, 
although they are frequently used in animal breeding. They 
are defined as 'the sum of the estimate of genetic deviation and 
the weighted sum of estimates of breed effects' (Van Vleck 
et at, 1992), which are predicted using phenotypic data from 
family pedigrees based on the additive infinitesimal model 
(Fisher 1918). Several statistical approaches have been proposed 
for the prediction of estimated breeding values (EBVs), such 
as best linear unbiased prediction (BLUP) (Henderson, 1975) 
and a Bayesian framework (Gianola and Fernando 1986). 
Furthermore, an innovative method for predicting breeding 
values was proposed based on genome-wide dense DNA 
markers, known as the GEBV (Meuwissen et ah, 2001). When 
the idea of GEBV was proposed, it was regarded as an unrealistic 
approach because of the lack of large-scale genotyping technolo- 
gies at the period. However, it has become a feasible approach 
with recent advances in high-throughput genotyping platforms. 
The term 'GS' was first introduced by Haley and Visscher 
at the 6th World Congress on Genetics Applied to Livestock 
Production at Armidale, Australia in 1998 according to 
Meuwissen (2007), although it was not used in the main text of 



Meuwissen etal. (2001). However, the overall MAS programme 
using GEBV was later referred to as GS. 

The general processes of GS and traditional MAS used for 
quantitative traits (QTs) are shown in Fig. 1 . The main frame- 
works of the two approaches are similar, where both GS and 
traditional MAS consist of training and breeding phases. In 
the training phase, phenotypes and genome-wide (GW) geno- 
types are investigated in a subset of a population, i.e. the train- 
ing population in GS and the mapping population in traditional 
MAS. Within populations, significant relationships between 
phenotypes and genotypes are predicted using statistical 
approaches. In the breeding phase, genotype data are obtained 
in a breeding population, before favourable individuals are 
selected based on the genotype data obtained. Three obvious 
differences between the two approaches are apparent: (1) in 
the training phase, quantitative trait loci (QTLs) are identified 
in traditional MAS while formulae for GEBV prediction are 
generated in GS, known as GS models; (2) in the breeding 
phase, genotype data are only required for targeted regions 
in traditional MAS, whereas GW genotype data are considered 
to be necessary in GS; (3) in the breeding phase, favourable 
individuals are selected based on the genotypes of markers in 
MAS, whereas GEBVs are used for selection in GS. Thus, GS 
jointly analyses all the genetic variance of each individual by 
summing the marker effects of GEBV (Heffner et ai, 2009), 
and it is expected to address small effect genes that cannot be 
captured by traditional MAS (Hayes et ai, 2009). 
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Fig. 1 . Schemes of genomic selection (GS) (left) and traditional MAS for the selection of quantitative traits (right). Both GS and traditional MAS contained 
training and breeding phases. In the training phase, quantitative trait loci (QTLs) are identified in traditional MAS to produce formulae for genomic estimated 
breeding value (GEBV) prediction, i.e. GS models. In the breeding phase, favourable individuals are selected based on the genotypes of the selected markers in 

MAS, whereas GEBVs are used for selection in GS. 



Since GS was first propounded by Meuwissen et al. (2001), 
many reports have indicated the usability of GS for breeding 
for QTs. However, GS has still not become a popular meth- 
odology in the field of plant breeding. We consider that a 
major obstacle is the availability of insufficient knowledge 
of GS for practical use. Indeed, most fields of GS studies 
have dealt with statistics and simulation that are discussed 
in terms of formulae, which are often too specific for breeders 
and molecular biologists to understand. To initiate further dis- 
cussions on the applicability of GS in plant breeding, here our 
aim is to discuss GS from a practical breeding viewpoint. 
First, the statistical approaches used in GS are briefly 
explained to understand the essence of this approach. 
Second, we survey recent progress in GS studies from the 
areas of animal and plant science, mainly addressing those 
dealing with empirical data. Third, we describe several spe- 
cific factors that require careful consideration before prac- 
ticing GS in plant breeding. Finally, we discuss future 
prospects for the further advancement of GS and MAS pro- 
grammes overall. 



STATISTICAL CONCEPTS USED IN GS 

All GS, traditional MAS and pedigree-based phenotypic selec- 
tion (PS) methods are reliant on a common selection framework, 
i.e. finding a causal relationship between genetic factors and 
target traits based on putative genetic factors underlying the 
phenotypic distribution (in PS) or observed marker genotypes 
(in GS and traditional MAS) in a training population. Before 
describing the statistical approaches used for GEBV prediction, 
we briefly review the general statistical concepts that are com- 
monly used in PS, traditional MAS and GS. 



Genetic models and variance decomposition 

A genetic model of QTs is generally constructed based on 
an assumption that only effects caused by genetic factors 
are inherited across the generations. A simple but frequently 
used genetic model is that the phenotypic value of an individ- 
ual (P) is expressed as the summation of the genetic value (G) 
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and the residual environmental effect (E): 
P=G + E 

where the genetic value G includes additive genetic effect, 
dominance effect and epistasis. If we suppose that there is 
no correlation between G and E (i.e. no G x E effect), the co- 
variance between G and E can be set at zero [Cov(G, E) = 0]. 
The phenotypic variance, V(P), is then expressed as the sum- 
mation of the genetic variance, V(G), and the environmental 
variance, V(E): 

V(P) = V(G) + V(E) + 2Cov(G, E) = V(G) + V(E). 



Heritability 

Heritability is a measure for evaluating the degree to which 
the phenotypic characteristics of a population are inherited to 
the next generation, and it is represented as the ratio of 
genetic variance to phenotypic variance. Broad sense heritabil- 
ity (H 2 ) focuses on the total genetic effects, G, including the 
additive, dominance and epistatic effects, whereas narrow 
sense heritability (h 2 ) counts only additive genetic effects. 
Therefore, for h , the genetic model (P = G + E) can be 
rewritten using the additive genetic effect, A: 

P = A + E' 

Here, E' represents the residual effects that are not included in 
the additive genetic effect, A. Note that the dominant and epi- 
static effects are in Et . If we suppose that there is no correlation 
between A and E 1 , then the phenotypic variance V(P) can be 
broken down into the additive genetic variance, V(A), and 
the residual effects variance, V(E): 

V(P) = V(A) + V(E'). 

Because h 2 is defined by the ratio of V(A) to V(P), it is repre- 
sented as follows: 

h 2 = V(A)/V(P). 

In GS, V(A) is broken down again into the variances explained 
by multiple DNA markers, V(Ai), V(A 2 ), . . . , V(A M ) under the 
assumption that DNA markers are not correlated with each 
other (Meuwissen et al., 2001). 



Breeding value (BV) 

The BV of an individual i in a population is defined as 
follows based on the narrow sense heritability, h 2 : 

BVi = m 0 + /r(y,- - m 0 ) = m 0 + (y t - m 0 )V(A)/V(P). 



Here, y; is the phenotypic value of individual i, while m 0 is the 
mean phenotypic value of the population. Because V(A) 
cannot be directly observed, h 2 has been conventionally esti- 
mated by a comparison of the phenotypic values of parents 
and their offspring. The BVs that are predicted based on an 
estimated heritability are known as EBVs. By contrast, 



phenotypic value y ; and the V(A) in GS are estimated based 
on the flux of the genotype effects of GW markers. Thus, 
the BV predicted in GS is known as the genomic EBV 
(GEBV). Note the residual effect variance V{E') is ignored 
in BV prediction, because narrow sense heritability is 
employed. 



Linear model for marker effects 

In many implementations of GS, the causal relationship 
between the phenotype and genotype is represented as a 
linear model or its extension, which is then used to infer the 
GEBV of an individual in a breeding population. Thus, the 
linear model is a fundamental model employed in GS. Here, 
we assume there are N individuals and M bi-allelic markers 
in the training population, and we focus on one of the 
markers. Let (y ; x u ) denote the pair of the observed pheno- 
type and genotype of the marker of the ith individual, i.e. 
(ji, *n)> (yi, x i2), ■ ■ ■ XyN> x in)- In addition, let us suppose 
that the bi-allelic genotypes are encoded by 0 and 1, respect- 
ively, and that the phenotypes of the N individuals are distrib- 
uted as shown in Fig. 2. Because an individual gains additional 
phenotypic value, j3\, depending on its marker genotype, the 
phenotype can be modelled as follows: 



V; = jSf, + XuPi + 6/ 



(1) 



where f3o and f}\ are the parameters to be determined, and e,- is 
an error term that is usually assumed to have a normal distri- 
bution with a mean of zero. This model is represented as a 
linear combination of the terms, known as a 'linear model', 
showing that the phenotypes of individuals with genotypes 0 
and 1 are normally distributed around j3o and /3 0 + Pi, respect- 
ively. The parameters of a linear model may be determined by 
least-squares estimation, such that the summation of ef, i.e. an 
error function E = 2,- (y,- - (3q - x u /3i) 2 , is minimized and the 
line is fitted to the phenotype. The linear model (1) represents 
the relationship between the genotype and phenotype for a 
single marker, but it can be extended to include all the M 




0 1 

Genotype 

Fig. 2. Relationships between marker genotypes (xu : 0 and 1) and pheno- 
types (y,) of the individuals (open circles) in a training population. If the 
marker genotype is correlated with the phenotype, segregation is modelled 
using the bold line (y, = /3 0 + Xu Pi, where /3 0 and /3[ are parameters to be 
determined.). 
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markers as follows: 

V/ = /3 0 + Xufo + X 2i P 2 ■■■ + x MiP M + s i 

= ?!* =0 XjiPj + si (2) 

where x 7 , is the genotype of the jth marker in the ith individual, 
coefficient fij is its effect on the phenotype, and x 0l - = 1 is a 
dummy variable. Similarly, the coefficients are determined 
by minimizing an error function, E = X,-= i (y,- - Yj= o x h 

Because GW genotype data are used in GS, a problem often 
arises, known as 'large p, small n problem', when the linear 
model (2) is employed for GEBV prediction (p and n are 
the numbers of markers and individuals, respectively). That 
is, a linear model that consists of p markers is too complicated 
for the prediction of BVs of n individuals. Thus, it can cause 
over-fitting and the linear model only works well in the train- 
ing population. To avoid over-fitting, a penalty term is intro- 
duced in the error function, i.e. E = X^L i (y,- - Xji o 
f3j)*~ + AXj= o \Pj\ CJ , where A is a parameter that controls the 
effects of the penalty term. Note that setting a high /3 7 inhibits 
the minimization of the error function. Setting q = 1 and q = 2 
are known as LASSO (least absolute shrinkage and selection 
operator) and ridge regression (RR in Table 2), respectively. 
Ridge regression forces all the coefficients to shrink toward 
zero equally, while LASSO can set several coefficients that 
are unrelated to the phenotype to zero. Therefore, if the pheno- 
type is controlled by many markers with small effects, ridge 
regression will capture those effects (Heffner et al., 2009), 
whereas LASSO will capture large effects with a small 
number of markers. If the coefficients of the markers are set 
to zero or a low value in the training phase, they are excluded 
from the model and their genotype information is not required 
during the breeding phase. 

For example, 'least-square estimation' and 'BLUP estima- 
tion' for effects of markers or chromosome segments in 
Meuwissen et al. (2001) adopt similar linear models. Here, 
BLUP stands for best linear unbiased prediction of a param- 
eter. As Heffner et al. (2009) summarizes, the methods 
using ridge regression assume that effects of markers have 
an equal variance. On the other hand, Bayesian methods that 
are known as BayesA and BayesB of Meuwissen et al. 
(2001) can make relaxed assumptions to estimate the variances 
of the effects of markers separately. In a Bayesian framework, 
effect of a marker is represented by distribution of a random 
variable that is determined by its prior distribution according 
to some assumptions. Actually, BayesA and BayesB adopt dif- 
ferent prior distributions for the variance of the effects of 
markers; that of the latter is defined to allow a part of 
markers to have no effects on a phenotypic value. Although 
simultaneous evaluation of markers and no need for marker 
selection are advantageous characteristics of GS, decreasing 
the number of markers required in the breeding phase might 
be preferable from the economic viewpoint. 

RECENT PROGRESS IN GS STUDIES 

The most important factor determining the success of GS is the 
accurate prediction of GEBVs. The accuracy of the predicted 



GEBVs is often estimated based on the correlation between 
the observed phenotypic value and GEBVs. To produce accur- 
ate GEBVs, several studies have applied comparative statistical 
approaches to GEBV prediction. In addition, simulations 
studies have been widely used to investigate the affect of the 
number of QTLs, markers, individuals and other variables. 
These studies were reviewed recently by Heffner et al. 
(2009) and Jannink et al. (2010), and so are not described 
further in this section. Instead, we focus on recent progresses 
in GS based on empirical data to understand better the prac- 
tical use of GS. 



Animal science 

Studies of GS are more common in the field of animal 
science than plant science. The BV concept was used in 
animal breeding long before the emergence of GS, so the 
GS approach was more readily accepted by animal scientists. 
In addition, the lower diversity of the targeted species and 
fewer effects of environmental factors during the growing 
stage might have contributed to the rapid introduction of GS 
in animal science. The first empirical GS study in animal 
science was reported by Legara et al. (2008) using mice 
(Table 1). A total of 1884 individuals were generated from 
eight inbred lines and genotyped using 10 946 single nucleo- 
tide polymorphism (SNP) markers, before predicting the 
GEBVs for four traits related to body sizes. A comparison of 
the predictive ability and accuracy of GEBVs generated with 
or without SNP genotypes and polygenetic effects demon- 
strated that GW genetic evaluation and selection provided 
better accuracy and predictive ability than the classical poly- 
genic model. 

The most advanced progress in GS has been observed in 
dairy cattle. In Table 1, the results of three GS studies in 
dairy cattle are summarized (Hayes et al, 2009; Luan et al., 
2009; Van Raden et al, 2009). In addition to the three 
reports in Table 1, seven empirical GS studies of dairy cattle 
were also reported and reviewed by Hayes et al. (2009), 
Moser et al. (2009) and Calus (2010). Of the three cattle 
studies in Table 1, a total of 500-5335 individuals were 
used for GEBV prediction using 18 991-38 416 SNPs. 
GEBVs for various QTs related to milk production, cattle 
body size and fertility were predicted using several different 
methods, where the accuracy of GEBVs ranged from 01 4 to 
0-69. Rolf et al. (2010) and Mujibi et al. (2011) reported 
GEBV prediction in beef cattle. Parentally identified steers 
and sires of 2405 Angus cattle were genotyped using 41 028 
SNPs in a study by Rolf et al. (2010), while an admixture 
population consisting of Angus, Charolais and hybrid bulls 
was genotyped using 37 959 SNPs for 721 individuals in a 
study by Mujibi et al. (2011). GEBVs for traits related to 
daily gain and daily intake were investigated, and the estimated 
accuracies ranged from —0-07 to 0-48. In chickens, Wolc et al. 
(2011) tested 16 traits related to eggs and chicken body size 
with 23 356 SNP genotypes using 2708 individuals derived 
from a single blown egg-layer line. The accuracy of GEBVs 
estimated ranged from 0-2 to 0-7. In addition, Calenge et al. 
(2011) reported GS studies on Salmonella carrier-state resist- 
ance in chickens (not shown in Table 1). 



Table 1. Features of test populations, number of genotyped SNPs and ranges of GEBV accuracy in empirical animal GS studies 
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Only studies that investigated the accuracy of GEBVs based on the correlation between observed phenotypic values and GEBVs are listed. 
11 Number of individuals used for GEBV prediction (training population) versus that used for validation (validating population). 
b Correlation between observed phenotypic values and GEBVs. 

c Models with the highest or higher accuracy of GEBVs when multiple methods were used for GEBV prediction. G-BLUP, Best linear unbiased prediction; RR-BLUP, random regression best linear 

unbiased prediction. 
d Random masking. 
e Cohort masking. 
f Across families. 
8 Within families. 
h Data cited from Figs 1 and 2. 
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The populations used in the empirical studies mentioned 
above were usually divided into two, i.e. training and validat- 
ing populations. Training populations were used to develop GS 
models based on genotypic and phenotypic data, whereas 
the validating populations were used for investigating the 
GEBV accuracy by estimating the correlation between the 
GEBVs predicted by the GS models and the observed pheno- 
typic values. Validation is not theoretically essential for a GS 
scheme (Fig. 1), although it is practically important to confirm 
the adequacy of a GS model before moving onto the breeding 
phase. Of the seven studies listed in Table 1, five considered 
pedigree relationships when the populations were divided 
into training and validating populations (Hayes et al., 2009; 
Luan et al, 2009; Van Raden et al, 2010; Rolf et al, 2010; 
Wolc et al, 2011). Thus, these studies reflected the entire 
GS process better compared with the others, because the 
breeding phases in GS were demonstrated virtually by the veri- 
fication of GS models using the progeny of the training 
populations. 

The reported studies used different materials and statistical 
methods for GEBV prediction, but many of these studies 
showed that the accuracy of GEBV was higher than that of 
traditional EBV and it was increased with a larger population 
size, larger numbers of genotyped SNPs, and higher heritabil- 
ity of the targeted traits. The details are not described here, but 
some of the studies compared different statistical methods for 
GEBV prediction. Note that the best approaches with the 
highest accuracy of GEBVs were different in each case 
(Table 1). The accuracy of GEBVs estimated in empirical 
studies fell below 0-7 (Table 1), which was lower than that 
suggested by many simulation studies such as 0-85 in 
Meuwissen et al (2001). Calus (2010) indicated that the distri- 
bution of QTL effects in real data is generally lower than that 
assumed in simulation studies. If this is true, the lower accur- 
acy estimated by real data might be affected by a lower number 
of QTL effects as well as other factors, such as the non- 
additive effects of QTLs and environmental factors. 

Plant science 

Plant breeding targets a diversity of species with different 
reproduction systems, generation times, genome structures 
and utilized organs. Thus, various methods are used in conven- 
tional breeding, i.e. PS and traditional MAS, to adapt to the 
demands of different targeted species and breeding objectives. 
Like conventional breeding, GS should be adapted to the fit 
different types of plant species and breeding objectives. 

Reports on plant species that specified 'genomic selection' 
or 'genomewide selection' have been published since 2007. 
Piyasatin et al. (2007) simulated the efficiency of GS in a 
cross of inbred lines, which is common in plant breeding but 
not in animal breeding. However, no specific plant species 
was considered as the targeted species in this paper. 
Simulation studies of specific species were firstly published 
for maize (Bernardo and Yu 2007), where a comparison 
between GS and marker-assisted recurrent selection (MARS) 
was demonstrated for three cycles of selection of doubled 
haploid lines (DHLs). The response of GS was 18-43 % 
greater than that of MARS with different numbers of QTLs 
(20, 40 and 100). Moreover, simulation studies using maize 



were performed to determine the advantages of using DHLs 
compared with F 2 populations in GS and MARS (Mayor and 
Bernardo, 2009), and to develop a methodology for the rapid 
introgression of exotic germplasms in an adapted line of 
maize via GS (Bernardo, 2009). In addition to maize, two 
GS simulations were performed with the oil palm, which is 
an outcrossing species that requires 19 years for one cycle of 
(PS) (Wong and Bernardo, 2008), and with a self-pollinated 
crop, barley (Bernardo, 2010). 

While these studies simulated biparental cross populations, 
three studies also reported GS simulation using multiple 
inbred lines in barley based on real genotype data obtained 
mainly from SNPs and diversity array technology (DArT) 
(Zhong et al, 2009; Jannink, 2010; Iwata and Jannink, 
2011). Zhong et al. (2009) compared the accuracy of four 
GS prediction methods that were affected by marker density, 
level of linkage disequilibrium (LD), QTL number, and 
sample size, where the level of replication in populations 
was generated using 42 multiple inbred lines of two-row 
spring barley with the genotypes of 1933 loci obtained from 
SNP, DArT and classical markers. They concluded that the 
GS prediction method with the highest accuracy changed 
with different levels of LD between the marker and QTLs, 
QTL effects, and generations of individuals. Moreover, Iwata 
and Jannink (2011) simulated the accuracy of GS using more 
large-scale data, consisting of 1325 SNPs in 863 breeding 
lines of barley derived from nine breeding programmes in 
the USA. Seven methods were used for GEBV prediction 
and the mean of the predictions in all methods was more accur- 
ate than predictions based on any single method under medium 
and high heritability. Jannink (2010) simulated the dynamics 
of long-term GS using 192 breeding lines from an elite 
six-row spring barley programme with genotypes identified 
by 983 polymorphic markers. The results suggested that 
losing favourable alleles with weak LD with markers during 
selection cycles was inevitable, while placing additional 
weight on low-frequency favourable alleles was important 
for long-term GS. 

Investigations of the accuracy of GEBV predictions using em- 
pirical data have been reported for maize, barley, wheat and 
Arabidopsis thaliana (Table 2). It was first demonstrated by 
Lorenzana and Bernardo (2009) for maize, A. thaliana and 
barley. All the test populations were generated from biparental 
crosses where the number of test progeny and markers ranged 
from 119 to 415 and 69 to 1339, respectively. Arabidopsis thali- 
ana had the highest accuracy of GEBVs, although the number of 
polymorphic markers used for genotyping was the lowest. This 
study was followed by demonstrations of GS using empirical 
data in maize by Piepho (2009), Crossa et al. (2010) and Guo 
et al. (2011), as shown in Table 2. Piepho (2009) compared 
the performance of nine models using a series of experiments 
with DHLs derived from a single cross conducted in five envir- 
onments, and suggested the need to appropriately model geno- 
type-environment interactions and to employ an independent 
estimate of error. Crossa et al. (2010) demonstrated GS using a 
genetically diverse population [300 lines bred in CIMMYT 
(The International Maize and Wheat Improvement Center)] 
and 1148 SNPs, with a predicted accuracy of GEBVs ranging 
from 0-42 to 0-79 by ridge regression BLUP. The largest-scale 
analysis of maize was performed by Guo et al. (2011), which 
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used 4699 progeny derived from 25 nested association mapping 
populations with genotypes for 1106 SNPs. While a common 
line, 'B73', was used as the maternal line across the 25 
mapping populations, the paternal lines were all different. 
Interestingly, the accuracy of the predicted GEB Vs was different 
in the 25 crosses, although the study used almost the same SNPs, 
targeted traits and population sizes. 

GS studies using empirical data from wheat were first 
reported by Crossa et al. (2010) using 1279 DArT genotypes 
and 599 wheat lines bred in CIMMYT. The targeted trait 
was grain yield and GEBVs predicted by reproducing kernel 
Hilbert spaced regression ranged from 048 to 0-61. In add- 
ition, Heffner et al. (2011) reported empirical results for 
wheat using 209 (CC population) and 174 (FKQ population) 
progeny of DHLs of biparental crosses with 399 and 574 poly- 
morphic genotypes, respectively. The accuracy of GEBVs in 
the CC and FKQ populations ranged from 0-32 to 0-84 and 
0-41 to 0-73, respectively (RR-BLUP, sample size was 96). 

GS of perennial crops is considered to be more effective 
than annual crops because of their long generation times. 
GEBV predictions based on empirical data were presented 
for Loblolly pine and eucalyptus at the IUFRO Tree 
Biotechnology Conference 2011 (Table 2; Grattapaglia et al, 
2011; Isik et al, 2011; Resende et al, 2011). All cases used 
full-sib families as test populations and the number of indivi- 
duals ranged from 149 to 920. In the two studies of Loblolly 
pine, 3406-3938 SNP markers were used for genotyping, 
while 3120-3564 DArT markers were used in the study of eu- 
calyptus. The GEBV accuracy of all studies ranged from 0-3 to 
0-77. 

Interestingly, the ranges of accuracies in empirical studies 
were higher in plant studies than animal studies, although 
most plant studies employed lower numbers of genotyping 
markers. This might be due to the lower genetic diversity 
caused by a small number of parental lines and a greater bottle- 
neck in the breeding materials. Note that the numbers of 
markers used for woody species was higher than that used 
for annual plant species. Empirical plant GS studies show 
that GS is a potential method for plant breeding and that it 
can be performed with realistic sizes of populations and 
markers when the populations used are carefully chosen. 

THE PRACTICE OF GS IN PLANT BREEDING 

Linkage disequilibrium (LD) 

LD has a major affect on the operability of GS, so it has to be 
well understood before performing GS. LD is defined as the 
non-random association of alleles at different loci (Williams 
and Cummings, 1997). The intensity of LD between two loci 
is measured based on the frequency of alleles, using indexes 
such as D, D' and r 2 , and it ranges from completely random 
(\D\ = \D'\ = r 2 = 0) to complete LD (\D\ = 0-25, \D'\ = 
r 2 = 1) (Gaut and Long, 2003). The LD intensity decays 
with greater distance between two markers. Although it is dif- 
ficult to delineate, a significant LD intensity is commonly con- 
sidered to be r 2 > 0T (Remington et al, 2001; Garris et al, 
2003; Palaisa et al, 2003). In general, the distance between 
two markers with significant LD intensity (r 2 > 0-1) is 
found to be greater in outcrossing species than selfing 



species, although it varies with different species, population 
structure and genome regions (Gupta et al, 2005). For 
example, observed marker intervals with significant LD inten- 
sity in outcrossing species are reported to be 100-150 bp in 
Loblolly pine, >500bp in grape and 0-4-7-0 kbp in maize, 
whereas those in selfing species are >50 kbp in soybean, 
100 kbp in rice and 250 kbp in A. thaliana (reviewed by 
Gupta et al, 2005). 

The number of markers required for GS modelling is deter- 
mined based on the marker interval with a significant LD 
intensity in targeted populations. In a case of Loblolly pine, 
the genome size exceeds 20 Gbp (Wakamiya et al, 1993) 
and the marker interval with a significant LD intensity was 
between 100 and 150 bp (Gonzalez-Martinez et al, 2004) in 
435 unrelated individuals. If the 435 individuals were used 
for GS modelling, the number of markers required would be 
at least 200 M (20 Gbp per 100 bp). However, significant 
GEBVs with 0-3-0-83 accuracy were obtained using 3406- 
3938 SNPs in full-sib families with Loblolly pine (Table 2; 
Isik et al, 2011; Resende et al, 2011). This large disparity 
in the number of required markers is caused by the different 
length in the marker interval with a significant LD intensity 
in an unrelated mapping population and full-sib families. In 
other words, employing a population that originated from a 
few parental lines is effective in reducing the number of 
markers required, especially for species whose LD intensities 
decay rapidly among unrelated individuals (see Fig. 3). 

Relationship between training and breeding populations 

In traditional MAS, a marker that is confirmed to have tight 
linkage with a target QTL or gene can be used as a selection 
marker in most breeding populations of that species. 
Therefore, breeders have not had to seriously consider the re- 
lationship between mapping populations and breeding popula- 
tions. However, in GS, the relationship between training and 
breeding populations must be carefully considered with the 
single exception of a marker set where adjacent markers 
have significant LD intensities across unrelated individuals in 
a pool of breeding materials genotyped for the training 
populations. 

Suppose that two pairs of lines used for biparental crosses 
are selected from a pool of breeding materials (Fig. 4). The 
genotypes of the flanking markers (II and IV) of a targeted 
gene/allele (yellow-coloured G) are 'white' in cross 1, while 
those in cross 2 are 'black'. This indicates that allele types 
with significant LD with the targeted genes are not kept 
across different crosses. When this happens in traditional 
MAS, we usually have to explore the markers nearest to the 
targeted genes to avoid false positive selection. However, 
because GW markers are used in GS, it is almost impossible 
or meaningless to explore the nearest markers to each GW 
marker. Thus, establishing a GS model based on a training 
population does not work in a breeding population if the 
genetic structures of both populations are different, except 
for the case described in the preceding paragraph. Indeed, in 
most reported GS studies, the training populations were 
assumed to consist of ancestors or randomly selected indivi- 
duals in a breeding population. Harris et al. (2008) reported 
that SNP estimates calculated from a Holstein-Friesian 



Table 2. Features of test populations, number of genotyped loci, and ranges of GEBV accuracy investigated in empirical plant GS studies 
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Size of 


Training 


No. of genotyped 


Accuracy of 


GEBV 






Species 


Population type 


population used 


population ratio* 


markers ' 


GEBVs* 


prediction 1 


Traits 


Reference 


Maize 


RILs derived from 
single cross 


223 


043, 0-65, 0-80 


1339 SSRs and 
RFLPs 


048-0-73 


BLUP 


8 morphological traits, 3 chemical 
components, grain moisture 


Lorenzana and 
Bernardo (2009) 


Maize 


RILs derived from 
single cross 


119 


0-80 


1339 SSRs and 
RFLPs 


040-0-50 


BLUP 


5 morphological traits, grain moisture 




Maize 


F2 derived from 
single cross 


349 


0-08, 0-13, 0-20 


160 SSRs 


0-59-0-72 


BLUP 


3 morphological traits, grain moisture 




Maize 


Testcrosses of DHLs 


371 


013, 0-26, 0-32 


125 SNPs 


0-31-0-55 


BLUP 


3 morphological traits, grain moisture 




Arabidopsis 


RILs derived from 


415 


012, 0-23, 0-29, 


69 SSRs 


0-90-0-93 


BLUP 


Flowering time, dry matter, free amino 




thaliana 


single cross 




0-32 








acids 




Barley 


DHLs derived from 
single cross 


150 


0-36, 0-64, 0-80 


223 RFLPs 


0-64-0-83 


BLUP 


Plant height, grain yield, 3 chemical 
components 




Barely 


DHLs derived from 
single cross 


140 


0-34, 0-69, 0-80 


107 RFLPs and 
AFLPs 


0-66-0-85 


BLUP 


Plant height, two chemical components 




Maize 


DHLs derived from 
single cross 


208 


100 


136 SNPs and 
SSRs 


l-00 § 


RR, POW, EXP, 
GAU, SPH 


Kernel dry weight 


Piepho (2009) 


Wheat 


Lines bred in 
CIMMYT 


599 


010 


1279 DArTs 


048-0-61 


PM-RKHS 


Grain yield 


Crossa et al. 
(2010) 


Maize 


Lines bred in 
CIMMYT 


300 


0-90 


1 148 SNPs 


042-0-79 


M-BL 


Grain yield, female flowering, male 
flowering, anthesis-silking interval 




Wheat 


DHLs derived from 
single cross 


209 


0 1 1,0-23,046 


399 SSRs, DArTs, 
AFLPs, TRAPs, 
STS 


0-32-0-84 


RR-BLUP 


8 grain quality 


Heffner et al. 
(2011) 


Wheat 


DHLs derived from 
single cross 


174 


014, 0-28, 0-55 


574 DArTs 


041-0-73 


RR-BLUP 


8 grain quality 




Maize 


25 nested association 
mapping populations 


(126-196) x 25 
populations 


0-20, 040, 0-60, 
0-80 


1106 SNPs 


0-26-0-57 


RR-BLUP 


Three flowering traits 


Guo et al. (2011) 


Loblolly pine 


61 full-sib families 
derived from 32 
parents 


790 - 840 


not shown 


3938 SNPs 


0-64-0-77 


BLUP 


Diameter at breast height, total height 


Resende et al. 
(2011) 


Loblolly pine 


Full-sib offspring 


149 


Not shown 


3406 SNPs 


0-3-0-83 


Pedigree model 


Growth and quality traits 


Isik etal. (2011) 


Eucalyptus 


43 full-sib family 1 1 
interspecific hybrids 


783 


0-90 


3120 DArTs 


0-53-0-69 


BLUP 


Height, diameter at breast height, wood 
density, pulp yield, lignin content, 


Grattapaglia et al. 
(2011) 


Eucalyptus 


75 full-sib family 55 
elite parents (hybrids) 


920 


0-90 


3564 DArTs 


0-54-0-62 


BLUP 


Puccinia rust resistance 





* Percentages of number of individuals in training populations to whole populations. 

' SSRs, Simple sequence repeat markers; RFLPs, restriction fragment length polymorphism markers; SNP, single nucleotide polymorphic markers; DArTs, diversity array technology markers; AFLPs, 
amplified fragment length polymorphism markers; TRAPs, target region amplification polymorphism markers; STS, sequence tagged site marker. 

* Correlation between observed phenotypic values and GEBVs. 

§ Correlation between adjusted mean and GEBVs. Error variance is not fixed. 

' s Models with the highest or higher accuracy of GEBVs when multiple methods were used for GEBV prediction. BLUP, Best linear unbiased prediction. Spatial models using: POW, power; EXP, 
exponential; GAU, Gaussian; SPH, spherical models. PM-RKHS, Pedigree plus molecular marker model using reproducing kernel Hilbert space regression. M-BL, Regression model using the 
Bayesian LASSO; RR, ridge regression. 
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Fig. 3. Variation of LD intensity in different populations of a single species. (A) Allele frequency and LD indexes (r ) between marker I and others in an 
unrelated population. Roman numerals represent markers mapped on a linkage group with 20-cM intervals. The two allele types, white and black, are represented 
in white and black. White allele freq. means the frequency of white alleles for markers II— V, in each case where the marker I allele is white or black. In this 
example, the white allele frequencies of markers II, III, IV and V are all 0-5, while the LD indices (r 2 ) between marker I and other markers are all zero (com- 
pletely random). (B) A population of clonally propagated individuals. Assume that an individual is selected from an unrelated population (outlined in blue in 
population 'A') and clonally propagated. All individuals in population 'B' share the same genotype. Thus, the r 2 between marker I and the other markers 
are all 1-0 (complete LD). (C) Suppose two individuals are selected from population 'A' (outlined in blue and red) and RILs (recombinant inbred lines) are 
developed based on a cross between the two individuals. Recombination occurs during meiotic division in the Ft, so the white allele frequency varies depending 
on the distances between marker I and other markers. Then, LD decays are observed in the RILs. 



training population did not produce accurate GEBVs in a 
Jersey population. Toosi et al. (2009) simulated the accuracy 
of GEBVs in admixed and cross-bred livestock populations, 
and found that the accuracy was greatly reduced when genes 
from the target pure breed were not included in the admixed 
and cross-bred population. 

Population size 

Several reports of simulation and empirical GS studies 
suggest that a larger training population size improves the ac- 
curacy of GEBV predictions. For example, Heffner et al. 
(201 1) reported that the average ratio of GS accuracy to PS ac- 
curacy for grain quality traits in biparental wheat populations 
containing 174 or 209 individuals were 0-66, 0-54 and 0-42 
for training population sizes of 96, 48 and 24, respectively. 
The ratio of the number of individuals in the training to the 



breeding populations varied in different studies. For 
example, it ranged from 0 08 to 100 in empirical studies of 
plants (Table 2). Although the appropriate ratio varied depend- 
ing on the genetic diversity, population size, heritability of 
traits and the number of QTLs, it can be suggested that a 
higher training : breeding population ratio is required with 
greater genetic diversity, smaller-sized breeding populations, 
lower heritability of traits and larger numbers of existing 
QTLs to obtain GEBVs with high accuracy. In addition, the 
balance of the population size and the genotyped marker is 
also important. When the population size is small and the 
genotype data are large, this often causes an overestimation 
of the genotype effect, which exaggerates minor flux in the 
data, i.e. the 'large p, small n' issue (Jannink et al., 2010). 

The empirical studies indicated that the sizes of training 
populations in plant GS studies were often smaller than 
those of animal studies (Tables 1 and 2). Two factors are 
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Fig. 4. Allele types of flanking markers for a targeted gene. Roman numerals 
represent the markers (I, II, IV and V) mapped on a linkage group. 'G' indi- 
cates a targeted gene. Distances between adjacent markers and the gene G 
are 20 cM. White and black represent the allele types of the markers, while 
grey and yellow indicate the allele types of a targeted gene. Suppose that 
the yellow allele is a favourable genotype on a targeted gene G. The LD 
between gene G, marker II and marker IV is completely random in a pool 
of breeding materials (unrelated population) while significant LD (r 2 = 0-8) 
is observed in RILs developed from biparental crosses (1 and 2), as shown 
in Fig. 3. When the two individuals outlined in red are selected for a biparental 
cross (B: cross 1), the genotypes of the flanking markers (II and IV) linked to 
gene G/yellow are white. By contrast, when the two individuals outlined in 
blue are selected for a biparental cross (C: cross 2), the genotypes of the flank- 
ing markers (II and IV) linked to gene G/yellow are black. This example indi- 
cates that the allele types with significant LD with the targeted genes are 
different between the two crosses. 



expected to affect the size differences of training populations. 
The first factor is the narrow genetic diversity of plant popula- 
tions, which is mainly caused by self-crossing reproduction 
and/or the smaller number of parental lines used for generating 
tested populations (biparental crosses have often been used). 
Because populations having greater genetic diversity require 
larger population sizes to obtain GEBVs with high accuracy 
(Mujibi et al, 2011), smaller sizes of training populations 
are used in plant GS studies, especially for self-crossing repro- 
duction species and/or biparental cross-derived populations. 
The second factor is the existence of a large quantity of 
legacy data about the phenotypes of pedigrees, which have 
been used to estimate traditional BVs in animal breeding 
(Hayes et al, 2009). The accumulated phenotype data should 
make performing GS studies possible with low cost. As with 
animal studies, pooling phenotypes of plant populations in 
which multiple regions have been investigated would be a 
promising approach for achieving success in plant GS 
studies, satisfying both high-accuracy GEBV and low experi- 
mental cost. 



Generally, a greater number of markers is required for a 
population where the marker intervals with a significant LD in- 
tensity are shorter. In addition, empirical and simulation 
studies suggest that a larger number of markers improves the 
accuracy of GEBVs. For example, Solberg et al. (2008) 
found that the simulated accuracy of GEBVs was improved 
by increasing the marker density from 0-25 to 8 SNP 
markers per centimorgan in 100 unrelated animals. 
Furthermore, in an inbred population derived from a biparental 
cross, Bernardo and Yu (2007) demonstrated that the response 
of the GEBV improves when decreasing the adjacent marker 
intervals from 28-0 to 7-0 cM, whereas no differences were 
observed with marker intervals of 7-0, 3-5 and 2-3 cM when 
the total length of the linkage map was assumed to be 1794 
cM. The heritability of targeted traits is also affected by the 
relationship between the density of markers and the accuracy 
of GEBV. Calus and Veerkamp (2007) demonstrated that an 
adjacent marker r 2 of 0-15 was sufficient for a trait with a her- 
itability of 50 %, while the GEBV accuracy was improved by 
increasing the r 2 to 0-2 for a trait with a heritability of 10 %. 
However, we have to consider that too many markers often 
leads to a loss in GEBV accuracy, as described in the 
section on population size. 

One of the obvious differences between GS and traditional 
MAS is the number of markers required for genotyping in a 
breeding population. In most GS studies, the whole set of 
markers used in the training population was also applied to 
the breeding or validating population. For example, suppose 
that the numbers of individuals in the training and breeding 
populations are 200 and 1000, respectively, and that the 
number of genotyped markers is 1000, then the genotype 
data points are 200 x 1000 in the training population and 
1000 x 1000 in the breeding population. This is quite different 
from traditional MAS, which requires a few selected marker 
genotypes that are related to a targeted trait in the breeding 
phase, except when investigating the GW genetic backgrounds 
of a breeding population in MARS. 

Several reports suggest that advances in genotyping tech- 
nologies will resolve the cost issue of the large number of 
genotype data points required in a breeding population and 
this idea might be correct. However, it is still necessary to 
conduct large-scale genotyping when performing GS in 
many breeding programmes, especially for non-major crops. 
To overcome this obstacle, several studies have used decreas- 
ing numbers of genotyped markers. For example, Habier 
et al. (2009) proposed a panel of evenly spaced low-density 
SNPs for tracking the effects of high-density SNP alleles 
within families based on the utilization of cosegregation infor- 
mation. Iwata and Jannink (2010) determined the imputation 
scores of untyped markers in a low-density genotyped panel 
by referencing a high-density panel in barley. Both studies 
were based on a common idea of predicting the interval geno- 
types of a population using low-density allelic data. By con- 
trast, Cleveland et al. (2010) discussed the performance of 
GEBV prediction by reducing the density of marker panels. 
It was found that low-density and evenly spaced SNPs per- 
formed poorly when predicting GEBV, whereas SNPs selected 
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based on their additive-effect size yielded accuracies similar to 
those at a high density. 

Heffner et al. (2009) proposed a model for a genomic selec- 
tion breeding programme, which consisted of a model training 
cycle and a line development cycle. It suggested that the most 
immediate impact of selecting an elite line by GEBVs would 
be a marked increase in the speed of the cycles. Shorter selec- 
tion cycles of populations would lead to a rapid change of 
genetic diversity in the breeding populations and would 
affect GEBV accuracy during long-term selection. In addition, 
novel recombinants generated during selection cycles would 
cause LD decay between markers and QTLs. This would be 
a more serious issue when lower-density markers are used 
for GEBV predictions. Goddard (2009) and Jannink et al. 
(2010) surveyed the dynamics of long-term selection responses 
by performing simulation studies, and concluded that GS leads 
to a more rapid decline in the selection response than PS 
unless new markers are continually added to the prediction 
of breeding value. They also suggested that placing additional 
weight on low-frequency favourable alleles, especially at the 
beginning of GS, was important for maximizing the long-term 
response in GS. 



Types of markers 

Most GS studies use SNP, DArT and simple sequence repeat 
(SSR) markers for genotyping. Results based on other types of 
markers in high-throughput genotyping systems will be 
reported in the near future, such as restriction site-associated 
DNA (RAD) (Miller et al, 2007) and genotyping by sequen- 
cing (GBS) (Elshire et al, 2011) marker systems. The DArT, 
RAD and GBS markers identify polymorphisms by hybridiza- 
tion or sequencing digested DNAs using restriction enzymes, 
so they are dominant markers except in the case where high 
coverage genome data are obtained for each individual using 
RAD and GBS markers. Li et al. (2007) showed that the LD 
detection power of a dominant marker is less than that of a 
co-dominant marker, and it was improved with a three-locus 
LD analysis. The results suggest that dominant markers lead 
to a lower accuracy of GEBV prediction than co-dominant 
markers and employing haplotypes would improve the 
accuracy. 

DNA markers are also categorized as bi-allelic markers and 
multi-allelic markers. The former includes SNP, DArT, GBS 
and RAD markers while the latter include SSR, RAPD 
(random amplified polymorphic DNA) and RFLP (restriction 
fragment length polymorphism) markers. Solberg et al. 
(2008) demonstrated the accuracy of GEBV prediction with 
SNP and SSR markers using 100 unrelated animals, and con- 
cluded that the SNP markers required two to three times 
greater density compared with using an SSR marker to 
achieve similar accuracy. With a bi-allelic marker, additional 
consideration must be given to the genetic sources used in 
marker development. Barendse et al. (2009) compared 
Australian and Bovine HapMap samples, and found differ- 
ences in the presumptive selective signatures when different 
breeds or SNPs were used. Based on these results, they sug- 
gested that using the same SNP is necessary when comparing 
the selection signatures among studies. 



RAD and GBS marker systems that can scan GW poly- 
morphisms in de novo would bypass the need for prior 
marker development and rather allow direct genotyping of 
the training and breeding populations. 



Traits 

The main advantage of MAS is considered to be the lack of 
a requirement for phenotyping during selection cycles 
(Heffner et al., 2009). More strictly, there is no need for the 
phenotyping of traits that were previously investigated in a 
training population. In conventional breeding, multiple 
expressed traits are investigated during a whole growing 
period and after harvesting. Thus, all traits of interest to bree- 
ders should be investigated during the training phase to 
exclude phenotyping during the breeding cycle if the gain of 
'selection' is regarded as equivalent between MAS and con- 
ventional breeding. In traditional MAS, only a few selected 
markers are used during the breeding phase, whereas GW gen- 
otypes are used in GS. Therefore, MAS for multiple traits is 
performed more systematically in GS, because there is no 
need to change the marker set used during the breeding 
phase. To our knowledge, no reports have been published on 
the selection of traits where trade-off relationships are 
observed in breeding materials, such as stress tolerance and 
quality. However, we considered that the selection of trade-off 
traits in GS will be a major issue in the near future, because the 
use of GW genotypes might help break up the trade-off rela- 
tionships of targeted traits, although current GS does not con- 
sider the weight of each marker effect in the result. 



Breeding scheme with GS 

According to published reports, GS is not assumed to be a 
perfect replacement for PS in plant breeding and instead it is 
proposed as a method for accelerating part of a whole breeding 
programme. For example, Bernardo and Yu (2007) proposed 
using GS during the off-season for the selection of random 
mating DHLs that are pre-selected for their test-crossing 
ability in the regular season by PS. By contrast, Heffner 
et al. (2009, 2010) and Jannink (2010) proposed using GS 
for parental selection to generate the breeding population in 
the next selection cycle. For example, 288 inbred (F 5 ) lines 
of winter wheat were assumed to be created and genotyped 
by single-seed descent (Heffner et al, 2010). F 5 -derived 
lines were grown in the field to increase the seeds, which 
were then selected for advanced testing based on their pheno- 
types and GEBV. Small numbers of F 5 -derived lines were 
selected based on GEBV to start recombining for the next 
cycle. In addition, phenotypic data from Fs-derived lines 
were used for GS modelling of the next cycle. The proposed 
scheme suggested that GS fits well with recurrent selection 
approaches that are not usually employed in the conventional 
breeding of selfing crop species. Interestingly, Heffner et al. 
(2010) also proposed using traditional MAS for important 
QTLs in the F2 and F3 generations, before GS in the F5 gener- 
ation. This eliminates unnecessary marker scoring and green- 
house space for lines that do not carry essential QTL alleles. 
These propositions suggest the importance of flexible GS 
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introduction into breeding programmes and combining it with 
other approaches, i.e. traditional MAS and PS. 

Computer package for GS modelling 

An R-Package for GS is available on http://www.r-project.org/. 
No user-friendly software has yet been developed, such as QTL 
Cartographer (Wang et al., 2011) and MapQTL (Van Ooijen, 
2004) that are used in QTL analysis. The development of a user- 
friendly software package is required to enhance the general 
application of GS. 

FUTURE PERSPECTIVES IN GS 

Like conventional breeding and traditional MAS, GS cannot 
be used for the selection of low heritability (in the narrow 
sense) traits. Narrow-sense heritability is defined as the ratio 
of the genetic variance of additive genetic effects to the pheno- 
typic variance. Thus, low heritability traits are caused by the 
high variance of non-additive genetic effects, such as environ- 
mental factors, G x E interactions, and dominant and epistatic 
genetic effects. 

Previous studies of QTL identification suggest that the mag- 
nitudes of G x E are unequal, with some QTLs expressed in 
all tested environments and others expressed in a particular en- 
vironment (Xu and Crouch, 2008). With GS, Goddard and 
Hayes (2007) indicated that different animals tend to be 
selected for the two environments when the genetic correlation 
between production in the two environments is <0-8. The only 
solution for overcoming the G x E issue is considered to be 
the accumulation and comparison of phenotypes investigated 
in different environments. Recently, Thomas (2010) reviewed 
potent approaches for gene-environment-wide association 
studies, i.e. mining GW association data for G x E interac- 
tions. For example, in an approach known as two-phase 
case-control, the whole case-control dataset was divided into 
subgroups based on different categories, such as age and 
gender, before correlations between SNPs and traits in the sub- 
groups were identified. The approaches are different from GS 
in plants and they cannot be applied directly, but the essence of 
the idea, i.e. dividing groups based on environmental condi- 
tions, might be applicable in the future development of GS 
methodology. Heffner et al. (2009) also suggested that GS is 
feasible for phenotypic data accumulation, because once phe- 
notypes have been evaluated in particular environmental con- 
ditions (e.g. severe winter once per a decade), the phenotypic 
values can be included in the GS model and used for selection. 
This idea is better suited to crops where the genetic structures 
of breeding materials are relatively fixed, and that have been 
investigated for a long time. In addition, careful consideration 
should be given to phenotypic values investigated under 
extreme environmental conditions. If such phenotypic values 
are mixed with other values investigated under normal condi- 
tions and then used for GS modelling, the GEBV accuracy will 
be decreased. The comparison of phenotypic values evaluated 
in different environmental conditions has not yet matured and 
further study will be needed. 

GEBVs are predicted based on additive genetic effects, so 
current GS does not consider both dominant genetic effects 
and epistasis. In traditional MAS, the dominant effects are 



considered because interval mapping and LD mapping can 
predict the dominant effect of QTLs. Thus, the consideration 
of dominant effects in GS modelling is expected to be 
achieved in the near future by improving modelling algo- 
rithms. By contrast, the consideration of epistasis in GS is 
more challenging. Previous studies indicate that the magnitude 
of epistatic effects depends on the species, population structure 
and targeted traits, although sometimes it is negligible (Xu and 
Jia, 2007) whereas in other cases it is more important than the 
additive effects (Malmberg et al, 2005; Mei et al., 2005; 
Dudley and Johnson, 2009). Isobe et al. (2007) developed a 
QTL mapping approach that searches for QTL interactions in 
genetic variation and they demonstrated that QTL interactions 
among small effect QTLs sometimes produce larger effects 
than single main-effect QTLs. Recently, Hu et al. (2011) 
used an empirical Bayesian method for GEBV prediction 
that had been used previously for the identification of epistatic 
QTLs by Xu and Jia (2007). The results showed that including 
epistatic effects greatly increased the accuracy of GEBV pre- 
diction compared with the non-consideration of epistatic 
effects. Epistasis demands vast amounts of calculation for its 
identification, but the consideration of epistatic effects in GS 
is an issue that needs to be addressed in the future. 

In conclusion, GS may be regarded as a potent, attractive 
and valuable approach for plant breeding. The main contribu- 
tion of GS to breeding MAS might be in providing a concept 
for the conversion of genotypic value to phenotypic value. 
With this idea, we are free from the pyramiding of QTLs 
and we can enjoy designing the ideal genotype based on the 
results of one or a few test trials. At the same time, GS is 
not a perfect method and several issues demand careful atten- 
tion and improvement. Moreover, as van der Werf (2007) sug- 
gested, GS leaves an understanding of the underlying biology 
behind a black box. In our opinion, the main weakness of 
current GS might be lack of value of contexts in genome 
sequences. Current GS algorithms have not been connected 
with previous and current studies of genetics and genomics, 
such as QTLs and (candidate) gene identification. By integrat- 
ing the essence of GS with other fields of genetics and 
genomic studies, it might be possible to escape the black 
box. Meuwissen (2007) noted that GS was considered a 
crazy idea when he and his colleagues proposed it. However, 
it has now become a realistic approach in plant breeding and 
we have validated its availability with empirical data. GS is 
not a final solution of MAS, but it is a turning point on the 
road that leads us to the next phase of MAS. GS will be inte- 
grated into many practical breeding programmes in the near 
future as it becomes more advanced and its theory matures. 
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